73 research outputs found

    NPGREAT: Hybrid Assembly of Human Subtelomeres with the Use of Nanopore and Linked-Read Datasets

    Get PDF
    The telomeres are vitally important regions that are located at the tips of the chromosomes. Their dysfunction, caused by length shortening can lead to senescent cells, which in turn cause age-related diseases, including cancer. The subtelomeres, located next to the telomeres, possess the critical role of regulating the adjacent telomere lengths. Even after many years of research, human subtelomeres have proven to be very hard to assemble due to their morphology. In order to overcome these problems, the hybrid assembly method we develop utilizes two of the latest available types of data, which complement each other: Linked-Reads and ultralong Nanopore reads. Our strategy has been to use initially the adjacent single-copy region of a telomere to search for the linked-read and nanopore read datasets that correspond to the subtelomere region in question. As a next step, we use the REXTAL (Regional Extension of Assemblies Using Linked-Reads) method to create the set of short-read assemblies derived from the selected linked-reads. We develop the NanoPore Guided Regional Assembly Tool (NPGREAT), which assembles the short-read REXTAL assemblies and the selected ultralong reads. In the NPGREAT, the ultralong Nanopore reads are used as scaffolds upon which the REXTAL contigs can be placed and corrected, replacing the low-quality Nanopore sequence with high-quality REXTAL sequence for matching regions. In the regions that lack REXTAL coverage, we retain the Nanopore sequence, stated as ā€œconnectorsā€, useful for spacing, orienting and ordering multiple REXTAL contigs. Its output is a single sequence. We tested NPGREAT on the NA12878 human subtelomeric regions. The output assemblies are of high percent identity with the hg38 reference, with differences only in the variable tandem-repeat regions of the sequence. The hybrid NPGREAT method provides for the first time the high quality continuous assembly of human subtelomeric regions.https://digitalcommons.odu.edu/gradposters2020_sciences/1000/thumbnail.jp

    Nanopore Guided Regional Assembly

    Get PDF
    The telomeres are the ā€œcapsā€ of the chromosomes and their vital role is to protect them. Possible telomere dysfunction caused by telomere rearrangements can be fatal for the cell and result in age-related diseases, including cancer. The telomeres and subtelomeres are regions that are hard to investigate. The current technology cannot provide their complete sequence, instead the DNA is given in multiple pieces. Current methods of assembling the pieces of these regions are not accurate enough due to the regionā€™s high variability and complex repeated patterns. We propose a hybrid assembly method, the NPGREAT, which utilizes two of the latest available data: Linked-Reads and ultralong Nanopore reads. It consists of five main steps: (i) The input selection of the data, (ii) the Orientation, Order and Enhanced Correction of the short contigs by using the long reads as scaffolds, upon which the short contigs are mapped to. Particularly, the Enhanced Correction step allows for the correction of potential misassemblies within the short contigs due to deletions in tandem repeat regions. The nanopore sequence is used to fill the missing portion, representing the tandem repeat region accurately, a region which is highly variable from one human to another. Next, in the (iii) Region Extraction step, the segments of the multiple long reads that can be used to connect the short contigs, are extracted. Then, in the (iv) Gap Filling step, all possible segments are taken into account and one is selected to fill each gap. Finally, in the (v) Combination step, the corrected short pieces are combined with the connector segments. The output is the subtelomere region of the chromosome. NPGREAT is evaluated with the use of the QUAST tool and the resulting assemblies are of high quality.https://digitalcommons.odu.edu/gradposters2021_sciences/1003/thumbnail.jp

    NPGreat: Assembly of the Human Subtelomere Regions with the Use of Ultralong Nanopore Reads and Linked Reads

    Get PDF
    Background: Human subtelomeric DNA regulates the length and stability of adjacent telomeres that are critical for cellular function, and contains many gene/pseudogene families. Large evolutionarily recent segmental duplications and associated structural variation in human subtelomeres has made complete sequencing and assembly of these regions difficult to impossible for many loci, complicating or precluding a wide range of genetic analyses to investigate their function. Results: We present a hybrid assembly method, NanoPore Guided REgional Assembly Tool (NPGREAT), which combines Linked-Read data with ultralong nanopore reads spanning subtelomeric segmental duplications to potentially overcome these difficulties. Linked-Read sets identified by matches with 1-copy subtelomere sequence adjacent to segmental duplications are assembled and extended into the segmental duplication regions using Regional Extension of Assemblies using Linked-Reads (REXTAL). Telomere-containing ultralong nanopore reads are then used to provide contiguity and correct orientation for matching REXTAL sequence contigs as well as identification/correction of any misassemblies (associated primarily with tandem repeats). While we focus on subtelomeres, the method is generally applicable to assembly of segmental duplications and other complex genome regions. Our method was tested for a subset of representative subtelomeres with ultralong nanopore read coverage in GM12878. 10X Linked-Read datasets with high depth of coverage and a TELL-seq Linked-Read dataset with lower depth of coverage were each combined with the ultralong nanopore reads from the same genome to provide improved assemblies. Tandem repeat regions of the short-read assemblies, which are especially prone to misassembly due to collapse of matching tandemly repeated reads, were readily identified and properly sized by comparison with the nanopore reads. Conclusion: The NPGREAT method resulted in extension of high-quality assemblies into otherwise inaccessible segmental duplication regions near telomeres, enhancing our ability to accurately assemble human subtelomere DNA. This information will enable improved analyses of the structure, function, and evolution of these key regions

    Near Real-Time Probabilistic Damage Diagnosis Using Surrogate Modeling and High Performance Computing

    Get PDF
    This work investigates novel approaches to probabilistic damage diagnosis that utilize surrogate modeling and high performance computing (HPC) to achieve substantial computational speedup. Motivated by Digital Twin, a structural health management (SHM) paradigm that integrates vehicle-specific characteristics with continual in-situ damage diagnosis and prognosis, the methods studied herein yield near real-time damage assessments that could enable monitoring of a vehicle's health while it is operating (i.e. online SHM). High-fidelity modeling and uncertainty quantification (UQ), both critical to Digital Twin, are incorporated using finite element method simulations and Bayesian inference, respectively. The crux of the proposed Bayesian diagnosis methods, however, is the reformulation of the numerical sampling algorithms (e.g. Markov chain Monte Carlo) used to generate the resulting probabilistic damage estimates. To this end, three distinct methods are demonstrated for rapid sampling that utilize surrogate modeling and exploit various degrees of parallelism for leveraging HPC. The accuracy and computational efficiency of the methods are compared on the problem of strain-based crack identification in thin plates. While each approach has inherent problem-specific strengths and weaknesses, all approaches are shown to provide accurate probabilistic damage diagnoses and several orders of magnitude computational speedup relative to a baseline Bayesian diagnosis implementation

    A Dynamic Programming Algorithm for Finding the Optimal Placement of a Secondary Structure Topology in Cryo-EM Data

    Get PDF
    The determination of secondary structure topology is a critical step in deriving the atomic structures from the protein density maps obtained from electron cryomicroscopy technique. This step often relies on matching the secondary structure traces detected from the protein density map to the secondary structure sequence segments predicted from the amino acid sequence. Due to inaccuracies in both sources of information, a pool of possible secondary structure positions needs to be sampled. One way to approach the problem is to first derive a small number of possible topologies using existing matching algorithms, and then find the optimal placement for each possible topology. We present a dynamic programming method of Ī˜(Nq2h) to find the optimal placement for a secondary structure topology. We show that our algorithm requires significantly less computational time than the brute force method that is in the order of Ī˜(qN h)

    ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data

    Get PDF
    Motivation: Insertion sequences (ISs) are transposable elements present in most bacterial and archaeal genomes that play an important role in genomic evolution. The increasing availability of sequenced prokaryotic genomes offers the opportunity to study ISs comprehensively, but development of efficient and accurate tools is required for discovery and annotation. Additionally, prokaryotic genomes are frequently deposited as incomplete, or draft stage because of the substantial cost and effort required to finish genome assembly projects. Development of methods to identify IS directly from raw sequence reads or draft genomes are therefore desirable. Software tools such as Optimized Annotation System for Insertion Sequences and IScan currently identify IS elements in completely assembled and annotated genomes; however, to our knowledge no methods have been developed to identify ISs from raw fragment data or partially assembled genomes. We have developed novel methods to solve this computationally challenging problem, and implemented these methods in the software package ISQuest. This software identifies bacterial ISs and their sequence elementsā€”inverted and direct repeatsā€”in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours, making it a valuable high-throughput tool for a global search of IS elements. We tested ISQuest on simulated read libraries of 3810 complete bacterial genomes and plasmids in GenBank and were capable of detecting 82% of the ISs and transposases annotated in GenBank with 80% sequence identity

    A Single Thread to Fortran Coarray Transition Process for the Control Algorithm in the Space Radiation Code HZETRN

    Get PDF
    Exa-scale computing is the direction by industry and government are going to generate solutions to problems they deem necessary. Computing hardware is being developed to achieve the transition from Peta-scale to Exa-scale with more CPUs (Central Processing Units) that have more cores per CPU and more accelerators (GPGPUs (General Purpose Graphics Processing Units) and MICs (Many Integrated Cores)) per node. To fully utilize the hardware available now and in the future, algorithms must become multi-threaded. There are a few methods to generate multi-threaded software such as MPI (Message Passing Interface) and OpenMP (Multi-Processing) / OpenACC (ACCelerator). This paper concentrates on using Coarray Fortran to convert the Fortran 95 based HZETRN (High Z and Energy TRaNsport) code's control algorithm from a single threaded code to a multithreaded code. The resultant Coarray code was 32.5 times faster (with a theoretical speed-up of 74.5 times) than the single threaded version on the hardware tested, as reliable as the Fortran 95 version, and, as it uses native Fortran, was as maintainable as the Fortran 95 version. The Coarray code can be maintained by the same project engineers and scientists who created the original single threaded code. This transition process can be utilized on a C language based code with a compiler that has the UPC (Universal Parallel C) extensions to C

    Multi-Dimensional Numerical Integration on Parallel Architectures

    Get PDF
    Multi-dimensional numerical integration is a challenging computational problem that is encountered in many scientific computing applications. Despite extensive research and the development of efficient techniques such as adaptive and Monte Carlo methods, many complex high-dimensional integrands can be too computationally intense even for state-of-the-art numerical libraries such as CUBA, QUADPACK, NAG, and MSL. However, adaptive integration has few dependencies and is very well suited for parallel architectures where processors can operate on different partitions of the integration-space. While existing parallel methods exist, most are simple extensions of their sequential versions. This results in moderate speedup and in many cases failure to significantly surpass the precision capabilities of the sequential methods. We propose a new algorithm for adaptive multi-dimensional integration of challenging integrands for execution on highly parallel architectures. We avoid the common sequential scheme of adaptive-methods in favor of a high-throughput approach better suited for parallel architectures. Experimental results show orders of magnitude speedup over sequential methods and improved performance in terms of maximum attainable precision.https://digitalcommons.odu.edu/gradposters2021_sciences/1002/thumbnail.jp
    • ā€¦
    corecore